Skip to content

perf: Remove allocation overhead when retrieving labels for Waypoints() and PolygonTriggers()#2761

Open
Mauller wants to merge 1 commit into
TheSuperHackers:mainfrom
Mauller:fix-waypoint+trigger-label-overhead
Open

perf: Remove allocation overhead when retrieving labels for Waypoints() and PolygonTriggers()#2761
Mauller wants to merge 1 commit into
TheSuperHackers:mainfrom
Mauller:fix-waypoint+trigger-label-overhead

Conversation

@Mauller
Copy link
Copy Markdown

@Mauller Mauller commented May 31, 2026

This PR reduces the allocation overhead seen when waypoint path labels and polygontrigger names are retrieved.

In mod maps with a lot of path scripted ai, such as Cobal Rush, waypoint labels are retrieved and compared every frame.
This will occur for every AI unit that is pathing along the waypoint. Polygon triggers also have a similar issue where they can be repeatedly compared every frame per scripted unit.

The following shows the side effects of this change for waypoint path labels.

Before:
image

After:
image

In game the most significant observation was that the first wave in Cobalt Rush went from stuttering to smooth.
The average FPS also significantly increased when observing the first wave and the camera also stopped stuttering.

The above flame graphs were captured in a debug build, but the stuttering still occurs in a release build and is alleviated with the same fix.

@Mauller Mauller self-assigned this May 31, 2026
@Mauller Mauller added Performance Is a performance concern Fix Is fixing something, but is not user facing Gen Relates to Generals ZH Relates to Zero Hour labels May 31, 2026
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 31, 2026

Greptile Summary

This PR changes four getter methods across both the Generals and GeneralsMD codebases to return const AsciiString& instead of AsciiString by value, eliminating per-call heap allocations on hot paths that are exercised every frame for each AI unit following scripted waypoint paths or entering polygon trigger areas.

  • Waypoint::getPathLabel1/2/3 (TerrainLogic.h, both trees): these are called repeatedly in isPurposeOfPath and ScriptConditions comparison loops that run per-unit per-frame in maps with heavy AI scripting.
  • PolygonTrigger::getTriggerName (PolygonTrigger.h, both trees): called each frame in AI guard/patrol states for each unit guarding a trigger area.
  • All existing call sites either immediately compare or .str() the result, or copy it into a local AsciiString by value — no dangling-reference risk is introduced.

Confidence Score: 5/5

Safe to merge — the change is a narrow, well-understood API signature tightening with no behavioral impact.

All four changed getters return a reference to a long-lived member field, and every existing call site either immediately uses the result in a comparison or copies it into a local AsciiString by value. No call site stores the reference beyond the expression where it is used, so no dangling-reference scenario exists in the current codebase. The Generals and GeneralsMD trees are updated symmetrically.

No files require special attention. All four header changes are consistent and low-risk.

Important Files Changed

Filename Overview
Generals/Code/GameEngine/Include/GameLogic/TerrainLogic.h Changes getPathLabel1/2/3 return type from AsciiString (by value) to const AsciiString& to eliminate per-call heap allocations; all call sites safely copy or compare the result without storing the reference.
Generals/Code/GameEngine/Include/GameLogic/PolygonTrigger.h Changes getTriggerName return type from AsciiString (by value) to const AsciiString& to eliminate per-call heap allocation; existing callers all assign to AsciiString by value or consume the result inline.
GeneralsMD/Code/GameEngine/Include/GameLogic/TerrainLogic.h Zero Hour mirror of the Generals TerrainLogic.h change; identical update to getPathLabel1/2/3 return types. Same safety assessment applies.
GeneralsMD/Code/GameEngine/Include/GameLogic/PolygonTrigger.h Zero Hour mirror of the Generals PolygonTrigger.h change; identical update to getTriggerName return type. Same safety assessment applies.

Sequence Diagram

sequenceDiagram
    participant AI as AI Unit (per frame)
    participant SC as ScriptConditions / isPurposeOfPath
    participant WP as Waypoint
    participant PT as PolygonTrigger

    AI->>SC: check waypoint path label
    SC->>WP: getPathLabel1() [before: AsciiString copy]
    WP-->>SC: "const AsciiString& (no alloc)"
    SC->>WP: getPathLabel2()
    WP-->>SC: "const AsciiString& (no alloc)"
    SC->>WP: getPathLabel3()
    WP-->>SC: "const AsciiString& (no alloc)"

    AI->>SC: check polygon trigger area
    SC->>PT: getTriggerName() [before: AsciiString copy]
    PT-->>SC: "const AsciiString& (no alloc)"
Loading

Reviews (1): Last reviewed commit: "perf: Remove allocation overhead when re..." | Re-trigger Greptile

Copy link
Copy Markdown

@xezon xezon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this improve performance? Because of the global ref counter lock?

PolygonTrigger *getNext() {return m_nextPolygonTrigger;}
const PolygonTrigger *getNext() const {return m_nextPolygonTrigger;}
AsciiString getTriggerName() const {return m_triggerName;} ///< Gets the trigger name.
const AsciiString& getTriggerName() const {return m_triggerName;} ///< Gets the trigger name.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not allocate. But it does touch the string ref counter.

Callers should also use references then, for example:

PolygonTrigger *TerrainLogic::getTriggerAreaByName( AsciiString name )
{
	for (PolygonTrigger* pTrig = PolygonTrigger::getFirstPolygonTrigger(); pTrig; pTrig = pTrig->getNext()) {
		AsciiString trigName = pTrig->getTriggerName(); // <---- can take const ref then
		if (name == trigName)
			return pTrig;
	}
	return nullptr;
}

@Mauller
Copy link
Copy Markdown
Author

Mauller commented May 31, 2026

Why does this improve performance? Because of the global ref counter lock?

This is where the problem lies, a copy is passed into compareNoCase which causes an allocation.

Waypoint *TerrainLogic::getClosestWaypointOnPath( const Coord3D *pos, AsciiString label )
{
	Real distSqr = 0;
	Waypoint *pClosestWay = nullptr;
	if (label.isEmpty()) {
		DEBUG_LOG(("***Warning - asking for empty path label."));
		return nullptr;
	}

	for( Waypoint *way = m_waypointListHead; way; way = way->getNext() ) {
		Bool match = false;
		if (label.compareNoCase(way->getPathLabel1())==0) match = true;
		if (label.compareNoCase(way->getPathLabel2())==0) match = true;
		if (label.compareNoCase(way->getPathLabel3())==0) match = true;
		if (match) {
			Coord3D curPos = *way->getLocation();
			Real newDistSqr = (curPos.x-pos->x)*(curPos.x-pos->x) + (curPos.y-pos->y)*(curPos.y-pos->y);
			if (pClosestWay==nullptr) {
				pClosestWay = way;
				distSqr = newDistSqr;
			} else if (newDistSqr < distSqr) {
				pClosestWay = way;
				distSqr = newDistSqr;
			}
		}
	}

	return pClosestWay;
}

And this is done hundreds to thousands of times per frame if there are a lot of path scripted units in a scene.

It is a night and day difference in wave 1 of cobalt rush when a reference is passed from the function instead.

CompareNoCase already takes a reference as well.

@xezon
Copy link
Copy Markdown

xezon commented May 31, 2026

I still don't see it.

Can you explain how it allocates?

if (label.compareNoCase(way->getPathLabel1())==0) match = true;

@Mauller
Copy link
Copy Markdown
Author

Mauller commented May 31, 2026

I still don't see it.

Can you explain how it allocates?

if (label.compareNoCase(way->getPathLabel1())==0) match = true;

Acsii strings are being allocated and deleted, you can see it when comparing the flame graphs since in the non reference variant there are calls to delete ascii strings and to release buffers.

image

@Caball009
Copy link
Copy Markdown

FWIW make sure to measure / benchmark with release builds.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented May 31, 2026

FWIW make sure to measure / benchmark with release builds.

i know, this does affect release too.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented May 31, 2026

With the fix in place, i guess the performance also comes from the critical section is not being hit.

image

@xezon
Copy link
Copy Markdown

xezon commented May 31, 2026

Acsii strings are being allocated and deleted, you can see it when comparing the flame graphs since in the non reference variant there are calls to delete ascii strings and to release buffers.

It is not allocating/deallocating. The cost comes from the critical section, which was added in #799 last year.

Better replace the critical section with atomic counting.

image

@Mauller
Copy link
Copy Markdown
Author

Mauller commented Jun 1, 2026

Acsii strings are being allocated and deleted, you can see it when comparing the flame graphs since in the non reference variant there are calls to delete ascii strings and to release buffers.

It is not allocating/deallocating. The cost comes from the critical section, which was added in #799 last year.

Better replace the critical section with atomic counting.

There was always a critical section, even in the original code, they just called it a "fast critical section".
I just simplified the code to match the original Generals implementation.

There still needs to be a lock as it is protecting the allocated data and not just the reference counter.
I can see if a mutex per string helps, otherwise the reference to the string works perfectly fine and doesn't hit the critical section in the hot path.

@xezon
Copy link
Copy Markdown

xezon commented Jun 1, 2026

There was always a critical section, even in the original code, they just called it a "fast critical section". I just simplified the code to match the original Generals implementation.

There still needs to be a lock as it is protecting the allocated data and not just the reference counter. I can see if a mutex per string helps, otherwise the reference to the string works perfectly fine and doesn't hit the critical section in the hot path.

This is not quite right. EA swapped the Critical Section in AsciiString for Atomic Counting. I expect this was deliberate, to tackle the same kind of performance culprits that you are observing now.

inline AsciiString::AsciiString(const AsciiString& stringSrc) : m_data(stringSrc.m_data)
{
  // don't need this if we're using InterlockedIncrement
  // FastCriticalSectionClass::LockClass lock(TheAsciiStringCriticalSection);
	if (m_data)
		// ++m_data->m_refCount;
    // yes, I know it's not a DWord but we're incrementing so we're safe
    InterlockedIncrement((long *)&m_data->m_refCount); // <----- Atomic counter
	validate();
}

inline void AsciiString::releaseBuffer()
{
  // FastCriticalSectionClass::LockClass lock(TheAsciiStringCriticalSection);

	validate();
	if (m_data)
	{
    InterlockedDecrement((long *)&m_data->m_refCount); // <----- Atomic counter
		if (!m_data->m_refCount)
			freeBytes();
		m_data = 0;
	}
	validate();
}

Using atomic counter is the right thing to do.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented Jun 1, 2026

There was always a critical section, even in the original code, they just called it a "fast critical section". I just simplified the code to match the original Generals implementation.
There still needs to be a lock as it is protecting the allocated data and not just the reference counter. I can see if a mutex per string helps, otherwise the reference to the string works perfectly fine and doesn't hit the critical section in the hot path.

This is not quite right. EA swapped the Critical Section in AsciiString for Atomic Counting. I expect this was deliberate, to tackle the same kind of performance culprits that you are observing now.

inline AsciiString::AsciiString(const AsciiString& stringSrc) : m_data(stringSrc.m_data)
{
  // don't need this if we're using InterlockedIncrement
  // FastCriticalSectionClass::LockClass lock(TheAsciiStringCriticalSection);
	if (m_data)
		// ++m_data->m_refCount;
    // yes, I know it's not a DWord but we're incrementing so we're safe
    InterlockedIncrement((long *)&m_data->m_refCount); // <----- Atomic counter
	validate();
}

inline void AsciiString::releaseBuffer()
{
  // FastCriticalSectionClass::LockClass lock(TheAsciiStringCriticalSection);

	validate();
	if (m_data)
	{
    InterlockedDecrement((long *)&m_data->m_refCount); // <----- Atomic counter
		if (!m_data->m_refCount)
			freeBytes();
		m_data = 0;
	}
	validate();
}

Using atomic counter is the right thing to do.

I was probably remembering the fast critical section bit that was not used, the thread safety is still a problem though which the atomic counters do not provide.

As far as i am aware ascii strings are used between threads in the audio system.

Might be wrong though, need to check.

The ref counter could always be placed within the Ascii/Unicode String instead of within the data that gets allocated to it.

Edit - actually the ref conter cannot be on the ascii/Unicode string object otherwise there could be out of sync copies of it.

@xezon
Copy link
Copy Markdown

xezon commented Jun 1, 2026

What is the concern regarding thread safety? The only write shared access point should be the counter, not the string data. The Atomic Counter is totally sufficient. The only problem currently is that VC6 only has 4 byte atomic counter, but we have a 2 byte counter here. I think 2 byte counter is available in assembler though.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented Jun 1, 2026

What is the concern regarding thread safety? The only write shared access point should be the counter, not the string data. The Atomic Counter is totally sufficient. The only problem currently is that VC6 only has 4 byte atomic counter, but we have a 2 byte counter here. I think 2 byte counter is available in assembler though.

The reference counter is part of the allocated data object for the string, so if one thread releases the data as another one accesses it, then you have a race condition.

@xezon
Copy link
Copy Markdown

xezon commented Jun 1, 2026

The reference counter is part of the allocated data object for the string, so if one thread releases the data as another one accesses it, then you have a race condition.

This is impossible. Ref counter will not reach 0 when 2 threads hold a reference.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented Jun 1, 2026

The reference counter is part of the allocated data object for the string, so if one thread releases the data as another one accesses it, then you have a race condition.

This is impossible. Ref counter will not reach 0 when 2 threads hold a reference.

One thread might be trying to gain a reference to the object just as another is releasing it, that's the situation i am considering.

@xezon
Copy link
Copy Markdown

xezon commented Jun 1, 2026

One thread might be trying to gain a reference to the object just as another is releasing it, that's the situation i am considering.

This is also impossible. If the unique owner is releasing it, there is will be no one else having any opportunity to take ownership. The Atomic counter approach is rock solid - do not worry.

@Mauller
Copy link
Copy Markdown
Author

Mauller commented Jun 1, 2026

One thread might be trying to gain a reference to the object just as another is releasing it, that's the situation i am considering.

This is also impossible. If the sole owner is releasing it, there is will be no one else having any opportunity to take take ownership. The Atomic counter approach is rock solid - do not worry.

How is it impossible?

if the original thread that owned the string deletes it just at the point that another thread goes to take a reference to it, you can have a race condition on checking if m_data exists. The original owning thread could clear the data and the new thread could just be behind it, pass the m_data check then fault on incrementing the counter as m_data is no longer valid.

Unless i am missing something, since the counter is part of the allocated data, i don't see how you cannot end up with a race condition in some circumstances.

@xezon
Copy link
Copy Markdown

xezon commented Jun 1, 2026

You are right that is a data race. The same principles as for std::shared_ptr apply. This is expected and okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Fix Is fixing something, but is not user facing Gen Relates to Generals Performance Is a performance concern ZH Relates to Zero Hour

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants